Skip to content

视频DPO训练报错 #8157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
zhanghang-official opened this issue May 26, 2025 · 2 comments
Open
1 task done

视频DPO训练报错 #8157

zhanghang-official opened this issue May 26, 2025 · 2 comments
Assignees
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@zhanghang-official
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

训练配置如下:

model

model_name_or_path: /raid/zhanghang02/weights/MiniCPM-V-2_6
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

method

stage: dpo
do_train: true
finetuning_type: lora

freeze_vision_tower: true

lora_rank: 8
lora_target: all
pref_beta: 0.1
pref_loss: sigmoid # choices: [sigmoid (dpo), orpo, simpo]

dataset

dataset: dpo_test_video
template: minicpm_v
cutoff_len: 256
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 1
dataloader_num_workers: 1

output

output_dir: saves/minicpmv/lora/dpo
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 300.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

eval

val_size: 0.1

per_device_eval_batch_size: 1

eval_strategy: steps

eval_steps: 500

报错如下:
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.31s/it]
[INFO|modeling_utils.py:4888] 2025-05-26 16:46:08,068 >> All model checkpoint weights were used when initializing MiniCPMV.

[INFO|modeling_utils.py:4896] 2025-05-26 16:46:08,069 >> All the weights of MiniCPMV were initialized from the model checkpoint at /raid/zhanghang02/weights/MiniCPM-V-2_6.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MiniCPMV for predictions without further training.
[INFO|configuration_utils.py:1093] 2025-05-26 16:46:08,156 >> loading configuration file /raid/zhanghang02/weights/MiniCPM-V-2_6/generation_config.json
[INFO|configuration_utils.py:1140] 2025-05-26 16:46:08,156 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}

[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.misc:143 >> Found linear modules: q_proj,v_proj,up_proj,k_proj,o_proj,down_proj,gate_proj
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set vision model not trainable: ['vpm'].
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set multi model projector not trainable: resampler.
[INFO|2025-05-26 16:46:08] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 8,119,360,240 || trainable%: 0.2486
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:741] 2025-05-26 16:46:09,007 >> Using auto half precision backend
[INFO|trainer.py:2369] 2025-05-26 16:46:09,246 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-05-26 16:46:09,246 >> Num examples = 109
[INFO|trainer.py:2371] 2025-05-26 16:46:09,246 >> Num Epochs = 300
[INFO|trainer.py:2372] 2025-05-26 16:46:09,246 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2375] 2025-05-26 16:46:09,246 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:2376] 2025-05-26 16:46:09,246 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2377] 2025-05-26 16:46:09,246 >> Total optimization steps = 32,700
[INFO|trainer.py:2378] 2025-05-26 16:46:09,250 >> Number of trainable parameters = 20,185,088
0%| | 0/32700 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/cli.py", line 115, in main
COMMAND_MAPcommand
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 78, in _training_function
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 80, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
return inner_training_loop(
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 133, in get_batch_samples
return Trainer.get_batch_samples(self, *args, **kwargs)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 5153, in get_batch_samples
batch_samples += [next(epoch_iterator)]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/accelerate/data_loader.py", line 566, in iter
current_batch = next(dataloader_iter)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
data = self._next_data()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
return self._process_data(data, worker_id)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
data.reraise()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 264, in call
return super().call(concatenated_features)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 157, in call
mm_inputs = self.template.mm_plugin.get_mm_inputs(
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 1080, in get_mm_inputs
image_bounds = torch.hstack(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 3 but got size 2 for tensor number 1 in the list.

0%| | 0/32700 [00:00<?, ?it/s]

Reproduction

Put your message here.

Others

No response

@zhanghang-official zhanghang-official added bug Something isn't working pending This problem is yet to be addressed labels May 26, 2025
@zhanghang-official
Copy link
Author

麻烦看下是什么问题,感谢!!

@hiyouga
Copy link
Owner

hiyouga commented May 26, 2025

cc @BUAADreamer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

3 participants